Db2 Statistical Functions

Db2 already has a variety of Statistical functions built in. In Db2 11.1, a number of new functions have been added including:

COVARIANCE_SAMP - The COVARIANCE_SAMP function returns the sample covariance of a set of number pairs
STDDEV_SAMP - The STDDEV_SAMP column function returns the sample standard deviation (division by [n-1]) of a set of numbers.
VARIANCE_SAMP or VAR_SAMP - The VARIANCE_SAMP column function returns the sample variance (division by [n-1]) of a set of numbers.
CUME_DIST - The CUME_DIST column function returns the cumulative distribution of a row that is hypothetically inserted into a group of rows
PERCENT_RANK - The PERCENT_RANK column function returns the relative percentile rank of a row that is hypothetically inserted into a group of rows.
PERCENTILE_DISC, PERCENTILE_CONT - Returns the value that corresponds to the specified percentile given a sort specification by using discrete (DISC) or continuous (CONT) distribution
MEDIAN - The MEDIAN column function returns the median value in a set of values
WIDTH_BUCKET - The WIDTH_BUCKET function is used to create equal-width histograms

Sampling Functions

The traditional VARIANCE, COVARIANCE, and STDDEV functions have been available in Db2 for a long time. When computing these values, the formulae assume that the entire population has been counted (N). The traditional formula for standard deviation is:

$$\sigma=\sqrt{\frac{1}{N}\sum_{i=1}^N(x_{i}-\mu)^{2}}$$

N refers to the size of the population and in many cases, we only have a sample, not the entire population of values. In this case, the formula needs to be adjusted to account for the sampling.

$$s=\sqrt{\frac{1}{N-1}\sum_{i=1}^N(x_{i}-\bar{x})^{2}}$$

Set up the connection to the database.



In [ ]:

    
%run db2.ipynb

We populate the database with the EMPLOYEE and DEPARTMENT tables so that we can run the various examples.



In [ ]:

    
%sql -sampledata

COVARIANCE_SAMP

The COVARIANCE_SAMP function returns the sample covariance of a set of number pairs.



In [ ]:

    
%%sql
SELECT COVARIANCE_SAMP(SALARY, BONUS) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

STDDEV_SAMP

The STDDEV_SAMP column function returns the sample standard deviation (division by [n-1]) of a set of numbers.



In [ ]:

    
%%sql
SELECT STDDEV_SAMP(SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

VARIANCE_SAMP

The VARIANCE_SAMP column function returns the sample variance (division by [n-1]) of a set of numbers.



In [ ]:

    
%%sql
SELECT VARIANCE_SAMP(SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

MEDIAN

The MEDIAN column function returns the median value in a set of values.



In [ ]:

    
%%sql
SELECT MEDIAN(SALARY) AS MEDIAN, AVG(SALARY) AS AVERAGE 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'E21'

CUME_DIST

The CUME_DIST column function returns the cumulative distribution of a row that is hypothetically inserted into a group of rows.



In [ ]:

    
%%sql
SELECT CUME_DIST(47000) WITHIN GROUP (ORDER BY SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

PERCENT_RANK

The PERCENT_RANK column function returns the relative percentile rank of a row that is hypothetically inserted into a group of rows.



In [ ]:

    
%%sql
SELECT PERCENT_RANK(47000) WITHIN GROUP (ORDER BY SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'A00'

PERCENTILE_DISC

The PERCENTILE_DISC/CONT returns the value that corresponds to the specified percentile given a sort specification by using discrete (DISC) or continuous (CONT) distribution.



In [ ]:

    
%%sql
SELECT PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'E21'

PERCENTILE_CONT

This is a function that gives you a continuous percentile calculation.



In [ ]:

    
%%sql
SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY SALARY) 
  FROM EMPLOYEE 
WHERE WORKDEPT = 'E21'

WIDTH BUCKET and Histogram Example

The WIDTH_BUCKET function is used to create equal-width histograms. Using the EMPLOYEE table, This SQL will assign a bucket to each employee's salary using a range of 35000 to 100000 divided into 13 buckets.



In [ ]:

    
%%sql
SELECT EMPNO, SALARY, WIDTH_BUCKET(SALARY, 35000, 100000, 13) 
  FROM EMPLOYEE 
ORDER BY EMPNO

We can plot this information by adding some more details to the bucket output.



In [ ]:

    
%%sql -a
WITH BUCKETS(EMPNO, SALARY, BNO) AS 
  ( 
  SELECT EMPNO, SALARY, 
    WIDTH_BUCKET(SALARY, 35000, 100000, 9) AS BUCKET 
  FROM EMPLOYEE ORDER BY EMPNO 
  ) 
SELECT BNO, COUNT(*) AS COUNT FROM BUCKETS 
GROUP BY BNO 
ORDER BY BNO ASC

And here is a plot of the data to make sense of the histogram.



In [ ]:

    
%%sql -pb
WITH BUCKETS(EMPNO, SALARY, BNO) AS 
  ( 
  SELECT EMPNO, SALARY, 
    WIDTH_BUCKET(SALARY, 35000, 100000, 9) AS BUCKET 
  FROM EMPLOYEE ORDER BY EMPNO 
  ) 
SELECT BNO, COUNT(*) AS COUNT FROM BUCKETS 
GROUP BY BNO 
ORDER BY BNO ASC